• DOMAIN: Electronics and Telecommunication

• CONTEXT: A communications equipment manufacturing company has a product which is responsible for emitting informative signals.

Company wants to build a machine learning model which can help the company to predict the equipment’s signal quality using various parameters.

• DATA DESCRIPTION: The data set contains information on various signal tests performed:

1. Parameters: Various measurable signal parameters.

2. Signal_Quality: Final signal strength or quality

• PROJECT OBJECTIVE: To build a classifier which can use the given parameters to determine the signal strength or quality.

1. Data import and Understanding

A. Read the ‘Signals.csv’ as DataFrame and import required libraries.

* WE HAVE A TOTAL OF 12 COLUMNS IN THE DATA SET.

* ALL THE COLUMNS ARE OF NUMERIC DATA TYPE.

* THERA RE 1599 ROWS OF DATA IN THE IMPORTED FILE.

* SIGNAL_STRENGTH COLUMN VALUE IS SEEMS TO BE DEPENDENT ON THE PARAMETER 1 TO 11.

* FROM THE DATA DESCRIPTION, IT SEEMS THAT THERE ARE NO NULL VALUES IN THE DATA.

* PARAMETER 7 SEEMS TO BE SKEWED HEAVILY TO RIGHT FOLLOWED BY PARAMETER 6 AND PARAMETER 4.

B. Check for missing values and print percentage for each attribute.

* THERE ARE NO MISSING VALUES IN THE DATA.

C. Check for presence of duplicate records in the dataset and impute with appropriate method.

* THERE ARE 240 DUPLICATE VALUES IN THE DATA WHICH IS ALMOST 15% OF THE DATA.

* SINCE THE DATA IS ABOUT THE SIGNAL STRENGTH AND QUALITY AND DATA IS NUMERIC, WE CAN KEEP ONLY 1ST ROW 
  AND DROP REMAINING ROWS TO HAVE ONLY UNIQUE ROWS IN THE DATA.

D. Visualise distribution of the target variable.

E. Share insights from the initial data analysis (at least 2).

* SIGNAL STRENGTH OF THE SIGNALS RANGES BETWEEN 3 - 8

* FOR MOST OF THE SIGNALS STRENGTH IS LYING BETWEEN 5 - 7

* THERE IS NEGATIVE VALUES IN CORRELATION MATRIX INDICATING WEAK CORRELATION

* THERE ARE 240 DUPLICATE ROWS PRESENT WHICH WE HAVE DROPPED FROM THE DATA SET BY KEEPING ONLY FIRST ROW

* THERE ARE PRESENSE OF OUTLIERS IN SOME OF THE SIGNAL PARAMETERS.

2. Data preprocessing

A. Split the data into X & Y.

B. Split the data into train & test with 70:30 proportion.

C. Print shape of all the 4 variables and verify if train and test data is in sync.

* TARGET VARIABLES Y, Y_TRAIN AND Y_TEST SEEMS TO FOLLOW SAME DISTRIBUTION IN THE PLOT

D. Normalise the train and test data with appropriate method.

E. Transform Labels into format acceptable by Neural Network

3. Model Training & Evaluation using Neural Network

A. Design a Neural Network to train a classifier.

* LET US BUILD A SEQUENTIAL MODEL AND TRAIN THE MODEL

* WE WILL ADD INPUT LAYER, HIDDEN LAYER AND OUTPUT LAYER

B. Train the classifier using previously designed Architecture

C. Plot 2 separate visuals.

i. Training Loss and Validation Loss

ii. Training Accuracy and Validation Accuracy

D. Design new architecture/update existing architecture in attempt to improve the performance of the model.

* APART FROM SGD OPTIMIZER, WE ARE NOT GETTING GOOD RESULTS AFROM OTHER OPTIMIZER FUNCTIONS.

* WE WILL STICK WITH THE SGD OPTIMIZER FOR THE DESIGN.
* MODEL SCORE IS NOT IMPROVED WITH THE DIFFERENT LEARNING RATES.
* BY ADDING ADDITIONAL LAYERS TO THE MODEL, ACCURACY FOR BOTH TRAINING AND TESTING DATA IS GOOD.

* ALSO THE LOSS ON TRAINING AND TESTING DATA REDUCED BY ADDING ADDITIONAL LAYERS.

* BY ADDING MORE LAYERS WE CAN EXPECT ACCURACY CAN BE INCREASED AND LOSS CAN BE REDUCED.
* BY INCREASING THE EPOCHS, WE DONT SEE ANY ACTUAL INCREASE IN THE TRAINING OR TEST ACCURACY.

* BY INCREASING THE HIDDEN LAYERS, MODEL IS BECOMING MORE STABLE. 

* AT TRAINING DATA ACCURACY OF 76.4% WE CAN SEE TEST DATA ACCURACY IS 61.27%

* BY ADDING MORE AND MORE LAYERS, WE CAN STILL INCREASE THE MODEL PERFORMANCE BUT MAY BECOME OVERFIT.

* WE WILL CONSIDER NN_MODEL5 AS THE UPDATED AND FINAL MODEL.

E. Plot visuals as in Q3.C and share insights about difference observed in both the models.

* INSIGHTS:

    * INCREASING THE NUMBER OF EPOCHS DID NOT IMPROVED THE PERFORMANCE.

    * VARIOUS OPTIMIZERS AND LEARNING RATES NEITHER IMPROVED THE PERFORMANCE.

    * SIGMOID NAD TANH ACTIVATION FUNCTIONS IMPROVED THE PERFORMANCE SLIGHTLY

    * AS THE HIDDEN LAYERS ARE INCREASED, WE CAN SEE THAT MODEL IS BECOMING STABLE.

    * TRAINING DATA GRAPH IS SHOWN BECOMING MORE AND MORE CLEAR CURVE.

    * LOSS IN TRAINING DATA IS DECREASED COMPARED TO OLD MODEL AND TRAINIG DATA ACCURACY IS SLIGHTLY INCREASED.

    * WE CAN ADD MORE LAYERS TO THE DESIGN AND INCREASE THE MODEL PERFORMANCE.

• DOMAIN: Autonomous Vehicles

• CONTEXT: A Recognising multi-digit numbers in photographs captured at street level is an important component of modern-day map making. A classic example of a corpus of such street-level photographs is Google’s Street View imagery composed of hundreds of millions of geo-located 360-degree panoramic images.

The ability to automatically transcribe an address number from a geo-located patch of pixels and associate the transcribed number with a known street address helps pinpoint, with a high degree of accuracy, the location of the building it represents. More broadly, recognising numbers in photographs is a problem of interest to the optical character recognition community.

While OCR on constrained domains like document processing is well studied, arbitrary multi-character text recognition in photographs is still highly challenging. This difficulty arises due to the wide variability in the visual appearance of text in the wild on account of a large range of fonts, colours, styles, orientations, and character arrangements.

The recognition problem is further complicated by environmental factors such as lighting, shadows, specularity, and occlusions as well as by image acquisition factors such as resolution, motion, and focus blurs. In this project, we will use the dataset with images centred around a single digit (many of the images do contain some distractors at the sides). Although we are taking a sample of the data which is simpler, it is more complex than MNIST because of the distractors.

• DATA DESCRIPTION: The SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with the minimal requirement on data formatting but comes from a significantly harder, unsolved, real-world problem (recognising digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.

Where the labels for each of this image are the prominent number in that image i.e. 2,6,7 and 4 respectively.

The dataset has been provided in the form of h5py files. You can read about this file format here: https://docs.h5py.org/en/stable/

Acknowledgement: Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, Andrew Y. Ng Reading Digits in Natural Images with Unsupervised

Feature Learning NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011. PDF http://ufldl.stanford.edu/housenumbers as the URL for this site.

• PROJECT OBJECTIVE: To build a digit classifier on the SVHN (Street View Housing Number) dataset.

Steps and tasks:

1. Data Import and Exploration

A. Read the .h5 file and assign to a variable.

B. Print all the keys from the .h5 file.

C. Split the data into X_train, X_test, Y_train, Y_test

2. Data Visualisation and preprocessing

A. Print shape of all the 4 data split into x, y, train, test to verify if x & y is in sync.

* X_TRAIN CONTAINS 42000 RECORDS WHERE AS TARGET Y_TRAIN HAS 42000 ELEMENTS

* X_TEST CONTAINS 18000 RECORDS WHERE AS Y_TEST CONTAINS 18000 ELEMENTS.

* SINCE THERE SAME NUMBER OF PREDICTORS AS TARGET VARIABLES WE CAN SAY THAT THE DATA IS IN SYNC

B. Visualise first 10 images in train data and print its corresponding labels.

* WE HAVE DISPLAYED THE FIRST 10 IMAGES FROM THE X_TRAIN AND CORRESPONDING Y_LABLES

C. Reshape all the images with appropriate shape update the data in same variable.

* SHAPE OF THE TRAINING IMAGES IS IN 3 DIMENSIONAL.

* WE WILL CHANGE THE IMAGES DATA FROM 3 DIMENSIONAL TO 2 DIMENSIONAL.
* WE HAVE CHANGED THE DIMNESIONS TO 2 DIMENSIONAL AS BELOW:

(60000,32,32) --- (60000,1024)
(42000,32,32) --- (42000,1024)
(18000,32,32) --- (18000,1024)

D. Normalise the images i.e. Normalise the pixel values.

* IN ORDER TO NORMALIZE THE PIXEL VALUES, WE WILL DIVIDE THEM BY 255 WHICH IS THE MAX RGB VALUE

* CONVERT THE NORMALIZED PIXEL VALUES INTO FLOATING DATA TYPE

E. Transform Labels into format acceptable by Neural Network

* WE CHANGE THE FORMAT OF THE EACH LABEL INTO ARRAY FORMAT WHERE THE SIZE IS THE TOTAL UNIQUE ELEMENTS.

F. Print total Number of classes in the Dataset.

* THERE ARE TOTAL OF 10 CLASSES IN THE DATASET RANGING FROM 0 TO 9.

3. Model Training & Evaluation using Neural Network

A. Design a Neural Network to train a classifier.

B. Train the classifier using previously designed Architecture (Use best suitable parameters).

* ADAMAX HAS REACHED 100% TRAINING ACCURACY WITH TESTING ACCURACY OF 85.7%

* ADAMAX HAS GIVEN BETTER RESULT TO THE MODEL DESIGN COMPARED TO THE SGD MODEL.
INITIAL MODEL:

            loss  accuracy  val_loss  val_accuracy
    97  0.177493  0.944548  0.996580      0.784722
    95  0.186299  0.940595  0.939283      0.803778
    99  0.191058  0.940214  0.948085      0.793222
    93  0.194492  0.937738  1.065166      0.771833
    94  0.194438  0.937690  0.999405      0.791389



FINAL MODEL:

            loss  accuracy  val_loss  val_accuracy
    99  0.116148  0.963524  0.966853      0.819278
    97  0.113372  0.963405  1.035892      0.812278
    94  0.118624  0.961714  1.160837      0.761000
    96  0.132339  0.958452  1.932626      0.704778
    92  0.128704  0.958333  0.991548      0.804111


* WE CAN SEE IN THE FINAL MODEL, ACCURACY IS INCREASED IN BOTH TRAINING AND TESTING DATA

* AT MAXIMUM ACCURACY, WE CAN SEE THAT LOSS IN BOTH CASES WAS DECREASED.

C. Evaluate performance of the model with appropriate metrics.

* WE WILL PREDICT THE X AND THEN PRINT THE PREDICTIONS AND ACTUAL TEST IMAGE
* FINAL MODEL HAS PERFORMED WITH AVERAGE ACCURACY OF 88.68% FOR TRAINING DATA AND 78.53% FOR TESTING DATA

* AVERAGE LOSS OF THE MODEL 0.35 FOR TRAINING DATA AND 0.84 FOR TESTING DATA

* MAXIMUM ACCURACY ACHIEVED BY THE MODEL FOR TRAINING DATA IS 96.35% 

* MAXIMUM ACCURACY ACHIEVED BY THE MODEL FOR TESTING DATA IS 81.92%

D. Plot the training loss, validation loss vs number of epochs and training accuracy, validation accuracy vs number of epochs plot and write your observations on the same.

* TRAINING LOSS AND ACCURACY FOLLOW A PROPER CURVE UPWARD OR DOWNWARD.

* TRAINING LOSS IS DECREASING OVER NO OF EPOCHS INCREASED.

* TRAINING ACCURACY IS INCREASED AS THE NO OF EPOCHS INCREASED

* TESTING DATA ACCURACY, LOSS IS NOT A PROPER CURVE AND VARIES TO HIGH AND LOW AT DIFFERENT EPOCHS.

* WE HAVE USED DIFFERENT PARAMETERS AND TRAINED THE MODEL AND WE COULD SEE A SLIGHT IMPROVEMENT
                    **************END OF SOLUTION FOR NEURAL NETWORK PROBLEM STATEMENT*************